Goto

Collaborating Authors

 annotated image


Training and Evaluating Multimodal Word Embeddings with Large-scale Web Annotated Images

Neural Information Processing Systems

In this paper, we focus on training and evaluating effective word embeddings with both text and visual information. More specifically, we introduce a large-scale dataset with 300 million sentences describing over 40 million images crawled and downloaded from publicly available Pins (i.e. an image with sentence descriptions uploaded by users) on Pinterest. This dataset is more than 200 times larger than MS COCO, the standard large-scale image dataset with sentence descriptions. In addition, we construct an evaluation dataset to directly assess the effectiveness of word embeddings in terms of finding semantically similar or related words and phrases. The word/phrase pairs in this evaluation dataset are collected from the click data with millions of users in an image search system, thus contain rich semantic relationships. Based on these datasets, we propose and compare several Recurrent Neural Networks (RNNs) based multimodal (text and image) models. Experiments show that our model benefits from incorporating the visual information into the word embeddings, and a weight sharing strategy is crucial for learning such multimodal embeddings.




MAG-Nav: Language-Driven Object Navigation Leveraging Memory-Reserved Active Grounding

Zhang, Weifan, Li, Tingguang, Liu, Yuzhen

arXiv.org Artificial Intelligence

Visual navigation in unknown environments based solely on natural language descriptions is a key capability for intelligent robots. In this work, we propose a navigation framework built upon off-the-shelf Visual Language Models (VLMs), enhanced with two human-inspired mechanisms: perspective-based active grounding, which dynamically adjusts the robot's viewpoint for improved visual inspection, and historical memory backtracking, which enables the system to retain and re-evaluate uncertain observations over time. Unlike existing approaches that passively rely on incidental visual inputs, our method actively optimizes perception and leverages memory to resolve ambiguity, significantly improving vision-language grounding in complex, unseen environments. Our framework operates in a zero-shot manner, achieving strong generalization to diverse and open-ended language descriptions without requiring labeled data or model fine-tuning. Experimental results on Habitat-Matterport 3D (HM3D) show that our method outperforms state-of-the-art approaches in language-driven object navigation. We further demonstrate its practicality through real-world deployment on a quadruped robot, achieving robust and effective navigation performance.


Reviews: Training and Evaluating Multimodal Word Embeddings with Large-scale Web Annotated Images

Neural Information Processing Systems

The paper was clear and well written. The data set and the evaluation that was conducted could be useful to the community. However, the paper unfairly characterizes or omits some previous work, and was not clear enough about the limitations and biases of their evaluation strategy. These points detract from a paper that otherwise makes an interesting contribution. First, there is an implied criticism of WordSim-353 and MEN at the bottom of page 2 that they only contain similarity judgments at the word level. However, there is a large amount of work on learning phrase and sentence-level embeddings in the recently literature that overcome these issues (see representative work by Mirella Lapata, Marco Baroni, Stephen Clarke, Richard Socher, among many others), which the paper does not mention.


A Good Foundation is Worth Many Labels: Label-Efficient Panoptic Segmentation

Vödisch, Niclas, Petek, Kürsat, Käppeler, Markus, Valada, Abhinav, Burgard, Wolfram

arXiv.org Artificial Intelligence

A key challenge for the widespread application of learning-based models for robotic perception is to significantly reduce the required amount of annotated training data while achieving accurate predictions. This is essential not only to decrease operating costs but also to speed up deployment time. In this work, we address this challenge for PAnoptic SegmenTation with fEw Labels (PASTEL) by exploiting the groundwork paved by visual foundation models. We leverage descriptive image features from such a model to train two lightweight network heads for semantic segmentation and object boundary detection, using very few annotated training samples. We then merge their predictions via a novel fusion module that yields panoptic maps based on normalized cut. To further enhance the performance, we utilize self-training on unlabeled images selected by a feature-driven similarity scheme. We underline the relevance of our approach by employing PASTEL to important robot perception use cases from autonomous driving and agricultural robotics. In extensive experiments, we demonstrate that PASTEL significantly outperforms previous methods for label-efficient segmentation even when using fewer annotations. The code of our work is publicly available at http://pastel.cs.uni-freiburg.de.


Semi-Self-Supervised Domain Adaptation: Developing Deep Learning Models with Limited Annotated Data for Wheat Head Segmentation

Ghanbari, Alireza, Shirdel, Gholamhassan, Maleki, Farhad

arXiv.org Artificial Intelligence

Precision agriculture involves the application of advanced technologies to improve agricultural productivity, efficiency, and profitability while minimizing waste and environmental impact. Deep learning approaches enable automated decision-making for many visual tasks. However, in the agricultural domain, variability in growth stages and environmental conditions, such as weather and lighting, presents significant challenges to developing deep learning-based techniques that generalize across different conditions. The resource-intensive nature of creating extensive annotated datasets that capture these variabilities further hinders the widespread adoption of these approaches. To tackle these issues, we introduce a semi-self-supervised domain adaptation technique based on deep convolutional neural networks with a probabilistic diffusion process, requiring minimal manual data annotation. Using only three manually annotated images and a selection of video clips from wheat fields, we generated a large-scale computationally annotated dataset of image-mask pairs and a large dataset of unannotated images extracted from video frames. We developed a two-branch convolutional encoder-decoder model architecture that uses both synthesized image-mask pairs and unannotated images, enabling effective adaptation to real images. The proposed model achieved a Dice score of 80.7\% on an internal test dataset and a Dice score of 64.8\% on an external test set, composed of images from five countries and spanning 18 domains, indicating its potential to develop generalizable solutions that could encourage the wider adoption of advanced technologies in agriculture.


Decoupled Deep Neural Network for Semi-supervised Semantic Segmentation Hyeonwoo Noh

Neural Information Processing Systems

We propose a novel deep neural network architecture for semi-supervised semantic segmentation using heterogeneous annotations. Contrary to existing approaches posing semantic segmentation as a single task of region-based classification, our algorithm decouples classification and segmentation, and learns a separate network for each task. In this architecture, labels associated with an image are identified by classification network, and binary segmentation is subsequently performed for each identified label in segmentation network. The decoupled architecture enables us to learn classification and segmentation networks separately based on the training data with image-level and pixel-wise class labels, respectively. It facilitates to reduce search space for segmentation effectively by exploiting class-specific activation maps obtained from bridging layers. Our algorithm shows outstanding performance compared to other semi-supervised approaches with much less training images with strong annotations in PASCAL VOC dataset.


One-Shot Segmentation of Novel White Matter Tracts via Extensive Data Augmentation

Liu, Wan, Lu, Qi, Zhuo, ZhiZheng, Liu, Yaou, Ye, Chuyang

arXiv.org Artificial Intelligence

Deep learning based methods have achieved state-of-the-art performance for automated white matter (WM) tract segmentation. In these methods, the segmentation model needs to be trained with a large number of manually annotated scans, which can be accumulated throughout time. When novel WM tracts, i.e., tracts not included in the existing annotated WM tracts, are to be segmented, additional annotations of these novel WM tracts need to be collected. Since tract annotation is time-consuming and costly, it is desirable to make only a few annotations of novel WM tracts for training the segmentation model, and previous work has addressed this problem by transferring the knowledge learned for segmenting existing WM tracts to the segmentation of novel WM tracts. However, accurate segmentation of novel WM tracts can still be challenging in the one-shot setting, where only one scan is annotated for the novel WM tracts. In this work, we explore the problem of one-shot segmentation of novel WM tracts. Since in the one-shot setting the annotated training data is extremely scarce, based on the existing knowledge transfer framework, we propose to further perform extensive data augmentation for the single annotated scan, where synthetic annotated training data is produced. We have designed several different strategies that mask out regions in the single annotated scan for data augmentation. Our method was evaluated on public and in-house datasets. The experimental results show that our method improves the accuracy of one-shot segmentation of novel WM tracts.


Self-mentoring: a new deep learning pipeline to train a self-supervised U-net for few-shot learning of bio-artificial capsule segmentation

Deleruyelle, Arnaud, Versari, Cristian, Klein, John

arXiv.org Artificial Intelligence

Background: Accurate segmentation of microscopic structures such as bio-artificial capsules in microscopy imaging is a prerequisite to the computer-aided understanding of important biomechanical phenomenons. State-of-the-art segmentation performances are achieved by deep neural networks and related data-driven approaches. Training these networks from only a few annotated examples is challenging while producing manually annotated images that provide supervision is tedious. Method: Recently, self-supervision, i.e. designing a neural pipeline providing synthetic or indirect supervision, has proved to significantly increase generalization performances of models trained on few shots. The objective of this paper is to introduce one such neural pipeline in the context of micro-capsule image segmentation. Our method leverages the rather simple content of these images so that a trainee network can be mentored by a referee network which has been previously trained on synthetically generated pairs of corrupted/correct region masks. Results: Challenging experimental setups are investigated. They involve from only 3 to 10 annotated images along with moderately large amounts of unannotated images. In a bio-artificial capsule dataset, our approach consistently and drastically improves accuracy. We also show that the learnt referee network is transferable to another Glioblastoma cell dataset and that it can be efficiently coupled with data augmentation strategies. Conclusions: Experimental results show that very significant accuracy increments are obtained by the proposed pipeline, leading to the conclusion that the self-supervision mechanism introduced in this paper has the potential to replace human annotations.